-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support format_number
#9281
Support format_number
#9281
Conversation
Signed-off-by: Haoyang Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only got part way through. I'll finish the review when it is no longer in draft.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
Can you please add the performance test results here? |
} | ||
} | ||
(intPartExp, decPartExp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If execption occurs at line 2371, then intPartExp
will be leaked.
Should handle this resource pair as a whole.
withResource(ArrayBuffer.empty[AutoCloseable]) { resource_array =>
var res1 = genRes()
resource_array ++= res1
var res2 = genRes()
resource_array ++= res2
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated all such cases I can founded.
} | ||
} | ||
(intPart, decPart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle resource pair as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to go through this, but the complexity, especially for the float/double code is rather hard to follow. Also in my own testing I saw a lot of problems with the float/double results. It looks like it is related to rounding and to the amount of precision we can get as output from casting a float to a string using CUDF. I'm not sure if we can fix that without a custom kernel.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comment for the processing will be better for review.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Show resolved
Hide resolved
Ok, updated them in the PR description. |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
} | ||
val substrs = closeOnExcept(sepCol) { _ => | ||
(0 until maxstrlen by 3).map { i => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
I think we do not need to reverse and reverse back the input string if we slice strings from end to start, something like
var curEndsCol: ColumnVector = strlen
val substrs = withResource(curEndsCol) { _ =>
(0 until maxstrlen by 3).safeMap { _ =>
val startCol = curEndsCol - 3,
val sub = closeOnExcept(startCol) { _ =>
str.substring(startCol, curEndsCol)
}
curEndsCol.close()
curEndsCol = startCol
sub
}.reverse
}
You need to do strlen - 3
in columnar way, e.g.
withResource(Scalar.fromInt(3)) { scalar3 =>
strlen.sub(scalar3)
}
This is just another option, not sure if it would have better perf.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is very close.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
Sorry that this is late, but I am seeing some errors with the latest code for decimal. It looks like the rounding is off in some cases. In the spark shell I ran import org.apache.spark.sql.DataFrame
def compare(left: DataFrame, right: DataFrame): DataFrame = {
val leftCount = left.groupBy(left.columns.map(col(_)): _*).count
val rightCount = right.groupBy(right.columns.map(col(_)): _*).count
val joinOn = leftCount.columns.map(c => leftCount(c) <=> rightCount(c)).reduceLeft(_ and _)
val onlyRight = rightCount.join(leftCount, joinOn, joinType="left_anti").withColumn("_in_column", lit("right"))
val onlyLeft = leftCount.join(rightCount, joinOn, joinType="left_anti").withColumn("_in_column", lit("left"))
onlyRight.union(onlyLeft)
}
spark.conf.set("spark.rapids.sql.enabled", false)
spark.time(spark.range(100000000L).selectExpr("*", "format_number(1 / CAST(id AS DECIMAL(38,0)), 4) as fnid", "1 / CAST(id as DECIMAL(38, 0))").write.mode("overwrite").parquet("/data/tmp/TEST_OUT_CPU"))
spark.conf.set("spark.rapids.sql.enabled", true)
spark.time(spark.range(100000000L).selectExpr("*", "format_number(1 / CAST(id AS DECIMAL(38,0)), 4) as fnid", "1 / CAST(id as DECIMAL(38, 0))").write.mode("overwrite").parquet("/data/tmp/TEST_OUT"))
spark.time(compare(spark.read.parquet("/data/tmp/TEST_OUT"), spark.read.parquet("/data/tmp/TEST_OUT_CPU")).orderBy("id", "_in_column").show(false)) It produced the following
I compared the results to bround, and it looks like we have a bug in there somewhere. I'll file a separate issue for that. Just giving you a heads up. |
val numberToRoundStr = withResource(zeroPointCv) { _ => | ||
withResource(leadingZeros) { _ => | ||
ColumnVector.stringConcatenate(Array(zeroPointCv, leadingZeros, intPart, decPart)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: scalar version is better.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@revans2 Thanks for the review, I think I fixed the memory issues.
I don't think we can fully match Spark's result on the plugin side for float/decimal yet, considering these cuDF issues. This PR can produce correct results with limited precision and aims to require minimal change to fully support double/float when float to string in JNI is ready. |
Signed-off-by: Haoyang Li <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
….scala Co-authored-by: Liangcai Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@@ -3102,6 +3102,12 @@ object GpuOverrides extends Logging { | |||
s" ${RapidsConf.ENABLE_FLOAT_FORMAT_NUMBER} to true.") | |||
} | |||
} | |||
case dt: DecimalType => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once rapidsai/cudf#14210 is fixed we should come back and retest to be sure that it is working properly.
build |
Thanks all for the review and help! merging this... |
Partly supported #9173
This PR support format_number for integral and decimal type. Float/double support is still wip, I plan to refactor this part with a string operations based solution to avoid precision error.
And for float/double type, the results will be mismatch between spark and plugin for large/small numbers.
It is because we should first convert a float/double to string correctly before formatting it, but in plugin casting float/double to string doesn't match Spark/Java's result, see compatibility doc. We may need a custom kernel for float to string casting, see #4204.
The solution is quite long and calls many times of cuDF API, which may make it slower than excepted. I did some performance test, it ran faster than CPU.
performance test results
10000000 random number generated by BigDataGen:
test code: